Source: wikipedia
Pulsar stars are a form of neutron stars, one of the last stages of a massive star before it collapses into a black hole. Pulsar stars have unique properties that can help us answer questions about the fabric of space-time, interstellar medium, and states of matter in our universe. Better detection of pulsar stars could help us better understand the space-time continuum and the world in which we all live. We hope to discover more aspects of previously unknown celestial bodies that astronomists and physicists can work on to enhance our insight into the far universe.
Pulsar search involves looking for radio signals with large telescopes. However, almost all detections are caused by radio frequency interference and noise, making legitimate signals hard to find. In this project, we conduct analysis to identify the characteristics of the radio signal that can help detect the true pulsars among the pulsar candidates using decision trees, random forest, and neural networks. We also compare the prediction accuracy of these methods.
The dataset used is HTRU2, a dataset which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey.
A pulsar is a neutron star that emits beams of radiation that sweep through Earth’s line of sight. Like a black hole, it is an endpoint to stellar evolution. The “pulses” of high-energy radiation we see from a pulsar are due to a misalignment of the neutron star’s rotation axis and its magnetic axis (NASA).
Its significane lies in the fact that a pulsar star provides information beyond Pulsars orbiting within the curved space-time around Sgr A*, the supermassive black hole at the center of the Milky Way, could serve as probes of gravity in the strong-field regime (Wikipedia). It helps us to validate Einstein’s general theory of relativity.
It is also a potential candidate for interstellar Global Positioning System. We can use pulsar stars as location marks.
Pulsar stars have different wave shapes, and each shape can be similar or even drastically distinct from another. It is hard to tell if one is a pulsar star based on mere observations. From figure one we can observe an aggregated graph of 100 pulses, each pulse being recorded at a certain time. They are all emitted forom a pulse star.
The shapes of each wave are different, their kurtosis, skewness vary a lot. It is hard to use ordinary signal processing to identify possible pulsar carriers, because one may have high false negative given that some pulsar stars have different characheristics and variables concerning our classfication and regression processes are non-parametric. They are distribution free. We need to dive into their characheristics and abandon heuristics if a result of high accuracy and sensitivity is more desired.
Our goal is to construct a model that maximizes accuracy on predicting pulsar stars. We take other factors such as specifity and sensitivity in to consideration.Source wikipedia
We collected information regarding each wave charachterisic, they include integrated profile and dispersion measure. Each contains mean, standard deviation, skewness, kurtosis, which are the first to the fourth moments of two charachteristics.
| Factors | ||
|---|---|---|
| Meanoftheintegratedprofile | Skewnessoftheintegratedprofile | ExcesskurtosisoftheDMSNRcurve |
| Standarddeviationoftheintegratedprofile | MeanoftheDMSNRcurve | SkewnessoftheDMSNRcurve |
| Excesskurtosisoftheintegratedprofile | StandarddeviationoftheDMSNRcurve | target_class |
We can see that there are clear differences between pulsar stars and normal stars in categories of mean and excess kurtosis of integrated profile,standard deviate of DM SNR curve. We can establish clear decision boundaries using these parameters. However, it is hard for us to distinguish pulsar stars from normal stars if we only use mean of the DM SNR curve and excess kurtosis of DM SNR curve because the decision boundaries in the saccttered plot
There are high correlation between excess kurtosis of integrated profile and mean of integrated profile, skewness of DM SNR curve and excess kurtosis of DM SNR curve. We want to address collinearity through regression methods such as random forest.
We first convert target_class into categorical variables. 1 means pulsar star, 0 means normal. We then split data into train, test, and validation, three sets of size 1200, 3898, and 2000. We use train to train our models, and validation to tune parameters. test is intact unless we evaluate models after tuning them.
We decide to use Principle Component Analysis to generate insights between variables.
## corrplot 0.84 loaded
Blue means positive correlation, red means negative correlation. Diameter of each circle is correlated with correlation. Larger a circle is, more correlation it is between two variables. we see that there is high correlation between Excess Kurtosis of the Integrated Profile and Mean of the Integrated Profile, Excess Kurtosis of the DM SNR curve and Standard Deviation of the DM SNR curve, Skewness of the DM SNR curve and the Excess Kurtosis of the DM SNR curve. We want to address these correlations because they would produce high collinearity that negatively impacts construction and validation of models. Our models will have overgeneralization issues if we do not properly handle it.
Possible solutions include Random Forest models, Neural Network models using dropout. Random Forest randomly bags a number of trees and use randomly selected variables to split in a tree, reducing collinearity and overfitting issues. Neural Network propagates variables in an artificial neural network structure, it employs stochastic gradient descent to minimize cross entropy loss. We use dropout to tune parameters and minimize degree of overfit.
We would also look into Principle Component Analysis. Here is a decomposition of principle components.
We used prcomp to break princple components into 8 parts. We see that the first principle is rather noisy, most data points are spread out. It has a megaphone shape. The noises are heteroskedactic. Only high dimension principle components are confined to a stationary process.
We see that the principle components PC 1,PC 2 have accounted for eighty percent proportion of variance. PC 1 is explaining half of cumulative variance.
We use ROC value of each variable as their important in classifying target class. The procedure of obtaining ROC of each variable is trivial. We select target class and use linear regression of target class on variables. We use pROC package to retrieve ROC of each variable.
| auc | predictor |
|---|---|
| 0.9498 | Meanoftheintegratedprofile |
| 0.8127 | Standarddeviationoftheintegratedprofile |
| 0.9692 | Excesskurtosisoftheintegratedprofile |
| 0.9394 | Skewnessoftheintegratedprofile |
| 0.8919 | MeanoftheDMSNRcurve |
| 0.8944 | StandarddeviationoftheDMSNRcurve |
| 0.884 | ExcesskurtosisoftheDMSNRcurve |
| 0.8858 | SkewnessoftheDMSNRcurve |
We see that Excess Kurtosisi of the Integrated Profile, Mean of the Integated Profile annd Skewenss of the Integrated Profile have high ROC values, they are stronger variables.
We split data into train, test, and validation, three sets of size 1200, 3898, and 2000. We use train to train our models, and validation to tune parameters. test is intact unless we evaluate models after tuning them.
We select variables under
cross validation result of lambda first standard deviation. Variables are Excess Kurtostis of the Integrated Profile, Mean of the DM SNR Curve, Standard Deviation of the DM SNR Curve, Skewness of the DM SNR Curve. We have an Area under Curve of 0.9826.
We fit five variables to a logistic regression and obtain our confusion matrix. Confusion matrix is good.
We see that there is still a huge value of false positives.
We realize that L1 norm may not properly shrink the model. We added L2 norm to see if it can help with improving accuracy and specificity.
We select variables under
cross validation result of lambda first standard deviation. Variables are Excess Kurtostis of the Integrated Profile, Mean of the DM SNR Curve, Standard Deviation of the DM SNR Curve, Skewness of the DM SNR Curve, Standard Deviation of the Integrated Profile.
We fit five variables to a logistic regression and obtain our confusion matrix. Confusion matrix is good. It has an Area under Curve of 0.9833.
Since most variables are not numerical, but categorical, it is good for us to consider binary decision trees.
## Loading required package: lattice
We obtain an ROC and Confusion Matrix plot.
Area under Curve is 0.9805.
We realized that models using bootstrap and bagging can greatly help reduce collinearity. We experimented using Random Forest. The performance is goodm but AUC is smaller. Starting on a forest using 500 trees, we choose 100 after finding a point that is suitable for OOB error reduce to a level.
We choose
num.tree=100. We tune mtry using num.tree=100. mtry is rather stochastic so we choose mtry=5 to include more predictors. This may be an issue because high volatility of error in different values of mtry means variables are highly correlated. We want to get rid of that problem later using neural network.
We listed variable importance in a table of variables.
| importance | |
|---|---|
| Meanoftheintegratedprofile | 234.09305 |
| Standarddeviationoftheintegratedprofile | 67.64586 |
| Excesskurtosisoftheintegratedprofile | 1030.49744 |
| Skewnessoftheintegratedprofile | 456.09285 |
| MeanoftheDMSNRcurve | 56.98666 |
| StandarddeviationoftheDMSNRcurve | 78.58280 |
| ExcesskurtosisoftheDMSNRcurve | 45.84543 |
| SkewnessoftheDMSNRcurve | 44.54817 |
Random Forest using parameters previously set has an Area under Curve of 0.9825.
## Using 100 trees...
We used cross validation folds of 7, and we set number of trees to be 100, interaction.depth to be 2.
We used Keras on tensorflow to build our neural network. We build a fully connected model with 2 layers, 5 layers, 10 layers using dropout in layer 5, 7. We start with overfitting neural network layers and use adam optimizer to boost efficiency and effects of gradient descent. We add drop out layers so we prevent unnecessary overfitting. We come to a 7 layer model at the end of procedure.
## Model
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 8) 72
## ___________________________________________________________________________
## dense_2 (Dense) (None, 8) 72
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 8) 0
## ___________________________________________________________________________
## dense_3 (Dense) (None, 8) 72
## ___________________________________________________________________________
## dense_4 (Dense) (None, 8) 72
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 8) 0
## ___________________________________________________________________________
## dense_5 (Dense) (None, 8) 72
## ___________________________________________________________________________
## dense_6 (Dense) (None, 6) 54
## ___________________________________________________________________________
## dense_7 (Dense) (None, 1) 7
## ===========================================================================
## Total params: 421
## Trainable params: 421
## Non-trainable params: 0
## ___________________________________________________________________________
## Warning in roc.default(test$target_class, deep, plot = TRUE): Deprecated
## use a matrix as predictor. Unexpected results may be produced, please pass
## a numeric vector.
We have an
Area under Curve is 0.9794.
We also used autoKeras to create neural network that adjusts itself. The implementation of autoKeras using python is straightforward. The neural network created is much simpler than ours and it has higher specificity.
Based on train data and validation data, we have optmized our models, their performance is collected and made to a table.
| Model | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|
| L1 Norm Logistic | 0.8721 | 0.9956 | 0.7485 | 0.9826 |
| L1, L2 Norms Logistic | 0.9755 | 0.9956 | 0.7602 | 0.9833 |
| Decision Tree | 0.9344 | 0.9936 | 0.8746 | 0.9805 |
| Random Forest | 0.9765 | 0.9942 | 0.8746 | 0.9825 |
| Gradien Boosting | 0.977 | 0.9945 | 0.7895 | 0.9681 |
| Neural Network | 0.9725 | 0.9934 | 0.7485 | 0.9794 |
If we want to maximize accuracy, we would choose Gradient Boosting Machine, it has an accuracy of 0.977, but we are not satisfied with its specificity. If we want to choose a model that optimizes on a metric of prioritizing specificity, we would choose random forest. It also has a second highest Area under Curve.
We use test data to check their performance.
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| L1 Norm Logistic | 0.9795 | 0.9946 | 0.8296 |
| Decision Tree | 0.9797 | 0.9921 | 0.8575 |
| Random Forest | 0.9823 | 0.9944 | 0.8631 |
| Gradient Boosting | 0.9769 | 0.9949 | 0.7989 |
| Neural Network | 0.9812 | 0.9932 | 0.7501 |
As a result, neural network and random forest have good generalization performance. They have high accuracy.
Since a check process of pulsar stars using scientific methods rather than simple inferencing takes years, it is better for us to choose a model that has a good specificity. We do not want to waste too much resources on validating a false pulsar star that we predicted to be true.
Our final model is a random forest. However, we suggest using a method similar to boosting, but instead of trees we bag or boost models.
We choose a combination of random forest and neural network to classify stars. If both report 0, we classify it as non-pulsar stars. If both report 1, we classify it as pulsar star. If one is 0, another is 1, we would use logistic regression using cross validation regularization of L1 norm to validate its status. Combining these predictions we can reduce false negative rate to as low as 0.01.
We have several ways to combine them. First, we can combine them using equal weights. That gives us a false negative rate of \[TNR_{neural network}*FNR_{rf}*FNR_{logistic}+TNR_{rf}*FNR_{neural network}*FNR_{logistic regression}\]
We can also deliberately assign weights based on their \[MSE\].
We can also choose more models and use non-parametric test such as wilcoxon-whitney test to check significane against \[H_0: y=0\]. We rank possible models in order of their accuracy, and perform wilcoxon-whitney test on numeric probability.
We would like to look straight into the waves and dive deeper into its shapes using convolutional neural network. Subtle changes in shape that mutate skewness, kurtosis could be picked by convolving layers that use feature selection to detect certain irregularities in waves.
We would also use kalman filter or exponential smoothing to smooth out the white noise. There are several waves that are close to white noise after running Kolmogrov Smirov test and Bartlett’s B test on white noise. Several waves have p-values close to 0.25, which are marginally insignificant. We want to use signal processing to first process given signals and remove certain levels of noises. An example could be applying kalman filter to the following wave:
Source: wikipedia
We would carefully select a dampening threshold and decay factor to minimize information loss and rough approximation due to filtering. If we overdamp the signal, we will lose information. We want to select a threshold that could keep waves intact and smooth curvatures that are volatile.